Analysis on countries infected by coronavirus

Les Bibliothéque utilisées :
  • ggplot2
  • factoextra
  • FactoMiner
  • Hmisc
  • ggExtra
  • corrplot
  • plotly
  • forecast
  • d3heatmap
  • clValid
  • MASS
  • rpart
  • NbClust
  • rms

1. Introduction :

Notre analyse va se porter sur 49 pays les plus atteint par le Covid-19 jusqu’au 28/04/2020, la base de données regroupe des informations sur le nombre de contamination / guéris / déces par le virus mais aussi des indicateurs comme GPP (global pandémic prepardness) ou le FPY ( Flights per year ) ..
Our analysis will be on the 49 countries most affected by the Coronavirus on thel ast count ‘28/04/2020’, we gathered information about confirmed cases, recovered cases, and deaths with some additional information about the countries.

2. Data structure :

The dataset have 10 variables :
  • Confirmed cases
  • Coutry population
  • Density/km²
  • Median age
  • Beds per 1000
  • GPP : Global pandemic preparedness index
  • FPY : Flights per year
  • Coutry income
  • Recovered cases
  • Deaths
str(df)
## 'data.frame':    49 obs. of  10 variables:
##  $ infected                          : int  6801 15597 9455 16705 49906 97100 56714 18435 82877 7285 ...
##  $ Population..millions.             : int  25448054 89977564 164403451 9449847 11580989 212298492 37684705 19087731 1438373881 50788168 ...
##  $ Density..Km.                      : int  3 109 1265 47 383 25 4 26 153 46 ...
##  $ median.age                        : num  37.9 43.5 27.6 40.3 41.9 33.5 41.1 35.3 38.4 31.3 ...
##  $ beds...1000                       : num  3.8 7.6 0.8 11 6.2 2.2 2.7 2.2 4.2 1.5 ...
##  $ GPP...global.pandemic.preparedness: num  75.5 58.5 35 35.3 61 59.7 75.3 58.3 48.2 44.2 ...
##  $ Income                            : Factor w/ 3 levels "high","lower/middle",..: 1 1 2 3 1 3 1 1 3 3 ...
##  $ FPY..flights.per.year             : int  665384 130260 101383 31676 140674 832683 1475063 134661 4692008 358847 ...
##  $ recovered                         : int  5814 13180 177 3117 12211 40937 23814 9572 78586 1666 ...
##  $ deaths                            : int  94 596 175 97 7765 6761 3684 247 4637 324 ...

3. Summary Stats :

##     infected       Population..millions.  Density..Km.    median.age   
##  Min.   :   6193   Min.   :2.872e+06     Min.   :   3   Min.   :22.80  
##  1st Qu.:   9523   1st Qu.:1.020e+07     1st Qu.:  49   1st Qu.:30.50  
##  Median :  18435   Median :3.768e+07     Median : 109   Median :38.30  
##  Mean   :  68960   Mean   :1.177e+08     Mean   : 332   Mean   :36.54  
##  3rd Qu.:  49906   3rd Qu.:8.998e+07     3rd Qu.: 225   3rd Qu.:42.20  
##  Max.   :1165868   Max.   :1.438e+09     Max.   :8358   Max.   :48.40  
##   beds...1000     GPP...global.pandemic.preparedness          Income  
##  Min.   : 0.600   Min.   :35.00                      high        :28  
##  1st Qu.: 1.600   1st Qu.:46.50                      lower/middle: 6  
##  Median : 2.800   Median :55.40                      upper/middle:15  
##  Mean   : 3.718   Mean   :55.11                                       
##  3rd Qu.: 4.700   3rd Qu.:62.20                                       
##  Max.   :13.400   Max.   :83.50                                       
##  FPY..flights.per.year   recovered          deaths     
##  Min.   :    1421      Min.   :    32   Min.   :   12  
##  1st Qu.:  119148      1st Qu.:  1534   1st Qu.:  229  
##  Median :  254064      Median :  4326   Median :  664  
##  Mean   :  716587      Mean   : 21082   Mean   : 4891  
##  3rd Qu.:  772926      3rd Qu.: 13386   3rd Qu.: 3336  
##  Max.   :11354693      Max.   :175382   Max.   :66369

4. Exploratory Data Analysis of Confirmed cases, recovered and deaths :

a. Confirmed Cases :

03000006000009000001200000EgyptMalaysiaSouth AfricaAustraliaPanamaColombiaDominican RepublicCzechiaNorwayPhilippinesBangladeshSerbiaDenmarkIndonesiaUkraineRomaniaPolandUnited Arab EmiratesJapanQatarAustriaIsraelBelarusSingaporeChilePakistanIrelandMexicoSwedenPortugalSaudi ArabiaEcuadorSwitzerlandIndiaNetherlandsPeruBelgiumCanadaChinaBrazilIranTurkeyRussiaGermanyFranceUnited KingdomItalySpainUS
Confirmed casesNumber of casesCoutry infected
Country Number of cases
US 1165868
Spain 247122
Italy 209328
United Kingdom 182260
France 168396
0.00.20.4IndiaIndonesiaChinaBangladeshEgyptPhilippinesPakistanSouth AfricaJapanColombiaMexicoAustriaMalaysiaAustraliaUkrainePolandBrazilRomaniaDominican RepublicCzechiaSaudi ArabiaRussiaChileSerbiaIranPeruUnited Arab EmiratesNorwayTurkeyCanadaEcuadorDenmarkPanamaBelarusIsraelGermanySwedenNetherlandsPortugalFranceUnited KingdomSingaporeSwitzerlandItalyUSIrelandBelgiumSpainQatar
Confirmed cases per Total PopulationRatioCountry infected
Country Number of cases (%)
Qatar 0.5414
Spain 0.5286
Belgium 0.4309
Ireland 0.4297
US 0.3526

b. Recovered cases :

050000100000150000NorwayNetherlandsBangladeshPanamaUnited KingdomSwedenPhilippinesSingaporeSerbiaDominican RepublicUkraineEgyptQatarIndonesiaColombiaPortugalEcuadorSouth AfricaUnited Arab EmiratesBelarusJapanCzechiaPolandSaudi ArabiaMalaysiaRomaniaPakistanAustraliaDenmarkChileIsraelIndiaBelgiumMexicoPeruAustriaIrelandRussiaCanadaSwitzerlandBrazilFranceTurkeyIranChinaItalySpainGermanyUS
Recovered casesNumber of casesCountry infected
Country Number of cases
US 175382
Germany 129000
Spain 117248
Italy 79914
China 78586
0255075NetherlandsNorwayUnited KingdomBangladeshSwedenPortugalSingaporeEcuadorPanamaQatarRussiaPhilippinesUkraineSaudi ArabiaIndonesiaUSSerbiaBelarusUnited Arab EmiratesDominican RepublicJapanColombiaBelgiumEgyptPakistanIndiaPolandPeruFranceRomaniaItalySouth AfricaCanadaBrazilCzechiaTurkeySpainChileMexicoIsraelIrelandMalaysiaDenmarkGermanyIranSwitzerlandAustriaAustraliaChina
Recovered cases per Confirmed cases (%)RatioCountry infected
Country Number of cases (%)
China 94.8224
Australia 85.4874
Austria 84.5034
Switzerland 80.9229
Iran 79.3952

c. Death cases :

0200004000060000QatarSingaporeAustraliaBelarusMalaysiaUnited Arab EmiratesSouth AfricaBangladeshSaudi ArabiaSerbiaPanamaNorwayIsraelCzechiaChileUkraineColombiaDominican RepublicEgyptPakistanJapanDenmarkAustriaPhilippinesPolandRomaniaIndonesiaPortugalPeruRussiaIrelandIndiaEcuadorSwitzerlandMexicoSwedenTurkeyCanadaChinaNetherlandsIranBrazilGermanyBelgiumFranceSpainUnited KingdomItalyUS
Number of death casesNumber of casesCountry infected
Country Number of Cases
US 66369
Italy 28710
United Kingdom 28205
Spain 25100
France 24763
051015QatarSingaporeBelarusSaudi ArabiaUnited Arab EmiratesRussiaChileAustraliaIsraelMalaysiaBangladeshSouth AfricaSerbiaPakistanUkraineTurkeyNorwayPanamaPeruCzechiaJapanIndiaAustriaPortugalGermanyDominican RepublicColombiaPolandDenmarkEcuadorChinaUSRomaniaSwitzerlandIrelandIranCanadaPhilippinesEgyptBrazilIndonesiaMexicoSpainSwedenNetherlandsItalyFranceUnited KingdomBelgium
Death cases per Confirmed Cases (%)RatioCountry infected
Country Deaths per T.population (%)
Belgium 0.067
Spain 0.0537
Italy 0.0475
United Kingdom 0.0416
France 0.038
0.000.020.040.06IndiaBangladeshPakistanSouth AfricaChinaIndonesiaMalaysiaSingaporeJapanEgyptAustraliaQatarSaudi ArabiaPhilippinesColombiaUkraineAustriaRussiaBelarusUnited Arab EmiratesChileMexicoPolandSerbiaCzechiaIsraelDominican RepublicBrazilPeruNorwayTurkeyRomaniaPanamaIranEcuadorGermanyDenmarkCanadaPortugalUSSwitzerlandIrelandSwedenNetherlandsFranceUnited KingdomItalySpainBelgium
Death cases per Total Population (%)RatioCountry infected
Country Deaths per confirmed cases (%)
Belgium 15.5593
United Kingdom 15.4751
France 14.7052
Italy 13.7153
Netherlands 12.3315

4. Exploratory Data Analysis of Categorical Data “Inncome” :

highlower/middleupper/middle01020
distribution by incomeNumber of coutryIncome

High Upper/Middle Lower/Middle
28 6 15
  • As shown on the graph above the coutries most affected by the virus have a high income.

a. Number of Confirmed cases per Income :

highlower/middleupper/middle 010000002000000
distribution by income of Confirmed casesIncomeConfirmed Cases
Total Cases per income :
High Upper/Middle Lower/Middle
2616591 695378 67079
Countries most affected / Income :
Country Number of cases Income
US 1165868 high
Spain 247122 high
Italy 209328 high
United Kingdom 182260 high
France 168396 high

Number of Recovered cases per Income :

highlower/middleupper/middle 0200000400000600000
distribution by income of Recovered casesIncomeRecovered Cases
Total number of cases / Income :
High Upper/Middle Lower/Middle
706001 316200 10803
Countries with the most recovered cases / Income :
Country Number of cases Income
US 175382 high
Germany 129000 high
Spain 117248 high
Italy 79914 high
China 78586 upper/middle

Number of Death cases per Income :
highlower/middleupper/middle050000100000150000200000
distribution by income of Recovered casesIncomeDeath Cases
Total number of cases / Income :
High Upper/Middle Lower/Middle
208230 28677 2743

Countries with the most Death cases / Income :
Country Number of cases Income
US 66369 high
Italy 28710 high
United Kingdom 28205 high
Spain 25100 high
France 24763 high

5. correlogram :

10.920.7830.8920.4090.157-0.0050.136-0.0610.9210.750.7490.4890.257-0.0130.084-0.070.7830.7510.6710.3160.2540.0540.236-0.0960.8920.7490.67110.3810.084-0.0270.41-0.060.4090.4890.3160.38110.530.076-0.1170.0090.1570.2570.2540.0840.5310.684-0.1580.094-0.005-0.0130.054-0.0270.0760.6841-0.114-0.0820.1360.0840.2360.41-0.117-0.158-0.1141-0.027-0.061-0.07-0.096-0.060.0090.094-0.082-0.0271infecteddeathsrecoveredFPY..flights.per.yearGPP...global.pandemic.preparednessmedian.agebeds...1000Population..millions.Density..Km.infecteddeathsrecoveredFPY..flights.per.yearGPP...global.pandemic.preparednessmedian.agebeds...1000Population..millions.Density..Km.

6. PCA ANALYSIS :

Calculate the PCA :

Extraction of eigenvalues / variances of the main components :
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.78283850        42.031539                    42.03154
## Dim.2 1.87023532        20.780392                    62.81193
## Dim.3 1.06470437        11.830049                    74.64198
## Dim.4 0.97301185        10.811243                    85.45322
## Dim.5 0.65224698         7.247189                    92.70041
## Dim.6 0.35220916         3.913435                    96.61385
## Dim.7 0.17428476         1.936497                    98.55034
## Dim.8 0.11179442         1.242160                    99.79250
## Dim.9 0.01867464         0.207496                   100.00000
Visualize the eigenvalues. Shows the percentage of variances explained by each main axis:

Key Results: Cumulative, Eigenvalue, barplot In these results, the first four principal components have eigenvalues greater than 1. These three components explain 85.45% of the variation in the data. The barplot shows that the eigenvalues start to form a straight line after the fourth principal component. But to make sens or our results we will use only the first two components as they represent more then 60% of the variation in the data.

Description of dimensions:
Description of dimension 1:
## $quanti
##                                    correlation      p.value
## infected                             0.9455062 1.525080e-24
## deaths                               0.9230268 3.976206e-21
## FPY..flights.per.year                0.8813770 6.442516e-17
## recovered                            0.8503451 1.061037e-14
## GPP...global.pandemic.preparedness   0.5802849 1.240577e-05
## median.age                           0.3624675 1.048339e-02
## 
## $quali
##                 R2      p.value
## clusters 0.8368592 7.743584e-19
## 
## $category
##                       Estimate      p.value
## clusters=clusters_1  7.2162854 1.527916e-13
## ds=high              0.8859782 4.086720e-02
## clusters=clusters_2 -2.6437500 2.255358e-02
## clusters=clusters_3 -4.5725354 4.329677e-06
## 
## attr(,"class")
## [1] "condes" "list "

The first principal component is strongly correlated with four of the original variables. The first principal component increases with increasing infected cases, deaths recoveries and flights per year. This suggests that these four criteria vary together. If one increases, then the remaining ones tend to increase as well. This component can be viewed as a measure of the quality of infected cases, deaths recoveries and flights per year, and the lack of quality in global.pandemic.preparedness . Furthermore, we see that the first principal component correlates most strongly with the infected cases. In fact, we could state that based on the correlation of 0.945 that this principal component is primarily a measure of the infected cases. It would follow that communities with high values tend to have a lot of human contact, in terms of lack of quarantine,lack of emergency state, etc. Whereas communities with small values would have took early precautions and respected quarantine.

FURTHERMORE, to make more sens and logic out of this , the flight per year is a global index that point us to number of lights of each country per year, we see that this variable is highly correlated with the first component the same as the infected, which means that when this variable increases the infected cases increases too.

PLUS, we can remark that the deaths are more correlated with the infected cases more than the recoveries despite that the recoveries are more then deaths, we can related to how fast was the deaths are happening just after infection " because at first the world was unable to do anything for the infected people just to put them in care and give them pain killers with no treatment" but after couple of weeks they understanded more this pendamic and came up with temporarily treatments.

Description of dimension 2:
## $quanti
##                                    correlation      p.value
## median.age                           0.8617675 1.873946e-15
## beds...1000                          0.7746392 6.538558e-11
## GPP...global.pandemic.preparedness   0.4086202 3.557739e-03
## Population..millions.               -0.4962679 2.873127e-04
## 
## $quali
##           R2     p.value
## ds 0.2183344 0.003463241
## 
## $category
##                  Estimate     p.value
## ds=high          0.914282 0.001730232
## ds=lower/middle -0.844556 0.017074167
## 
## attr(,"class")
## [1] "condes" "list "

The second principal component increases with only two of the values, median.age and beds…1000. This component can be viewed as a measure of how healthy the location is in terms of available beds of hospitals for health care, and the average age of people in that country. FURTHERMORE, as we stated above in variale discription, the the GPP represents how much,on sacle of 100, the country is prepared if a pendamic goes wild. Unfortunatly this variable is not very correlated with the others which means that even the hilghy prepared and top ranked countries were not prepared enough for coronavirus COVID-19 pendamic, so here we can say that it doesn’t matter if you are prepared for this or not your should take more precautions.

Graph of individuals. Coloring according to kmeans clustring by infected. Similar individuals are grouped together:
## Too few points to calculate an ellipse

As we can see in the graph above that US is concedered as an group by it self and the other two groups have some intersection and that’s probably because they have very close number of infected cases.

Graph of variables. Coloring according to the kmeans clustring of the variables. The positively correlated variables are on the same side of the graph. The negatively correlated variables are on opposite sides of the graph:

Here, we used kmeans segmentation to divide the variables and infact this graph confirme our interpretaion above that the infected cases, deaths recoveries and flight per year are highly correlated and this graph shows that they are in the same group of cluster which is very logic, the same goes for beds per 1000km², median age, and GPP. but we can see that it has grouped the density and population count which is very normal because the dansity is calculated based on the population count in the first place, but the cercle of corelation tells a diffirent story, these last two variables are negativily correlated plus the density per km² is badly represented here.

Coordinates of variables on main axes:
##                                          Dim.1       Dim.2       Dim.3
## infected                            0.94550621 -0.15270460  0.06201992
## Population..millions.               0.22784880 -0.49626791 -0.39805463
## Density..Km.                       -0.08330142  0.08300299  0.74719579
## median.age                          0.36246752  0.86176749 -0.04798405
## beds...1000                         0.09999117  0.77463924 -0.46194778
## GPP...global.pandemic.preparedness  0.58028489  0.40862020  0.32171733
## FPY..flights.per.year               0.88137704 -0.27118185 -0.05511578
## recovered                           0.85034508 -0.08794865 -0.09747378
## deaths                              0.92302684 -0.05283089  0.11121919
##                                          Dim.4       Dim.5
## infected                           -0.08358150 -0.19939353
## Population..millions.               0.63119902  0.36268821
## Density..Km.                        0.62758265 -0.17823251
## median.age                          0.19137179  0.10429020
## beds...1000                         0.24777144 -0.22882491
## GPP...global.pandemic.preparedness -0.19354036  0.57542262
## FPY..flights.per.year               0.12535300  0.02682055
## recovered                           0.03524162 -0.19743798
## deaths                             -0.14603677 -0.12300428
coordiantes on axe 1:

the top three variables with the best or highest coordinates are “infected with 0.945” , " deaths with 0.923" , “FPY..flights.per.year 0.881”and “recovered 0.85”, here we recall that these variales are highly correlated on the component 1

coordiantes on axe 2:

the top three variables with the best or highest coordinates are “median.age with 0.861” , " beds…1000 with 0.774" and the “GPP…global.pandemic.preparedness” have medium to low coordinates with " 0.408", here we recall that the first two variales are highly correlated on the component 2

Cos2: quality of representation of the variables on the main axes:
##                                          Dim.1       Dim.2       Dim.3
## infected                           0.893982000 0.023318696 0.003846471
## Population..millions.              0.051915074 0.246281834 0.158447491
## Density..Km.                       0.006939126 0.006889496 0.558301550
## median.age                         0.131382705 0.742643211 0.002302469
## beds...1000                        0.009998235 0.600065955 0.213395748
## GPP...global.pandemic.preparedness 0.336730558 0.166970471 0.103502041
## FPY..flights.per.year              0.776825494 0.073539593 0.003037749
## recovered                          0.723086756 0.007734965 0.009501138
## deaths                             0.851978555 0.002791103 0.012369709
##                                          Dim.4        Dim.5
## infected                           0.006985866 0.0397577811
## Population..millions.              0.398412207 0.1315427406
## Density..Km.                       0.393859979 0.0317668275
## median.age                         0.036623161 0.0108764448
## beds...1000                        0.061390685 0.0523608389
## GPP...global.pandemic.preparedness 0.037457870 0.3311111941
## FPY..flights.per.year              0.015713375 0.0007193416
## recovered                          0.001241972 0.0389817574
## deaths                             0.021326737 0.0151300539

quality of representation on Axe 1 :

the top three variables with the best or highest quality of representation are “infected with 0.893” , " deaths with 0.851" , “FPY..flights.per.year 0.776”and “recovered 0.723”, here we recall that these variales are highly correlated on the component 1

quality of representation on Axe 2 :

the top three variables with the best or highest quality of representation are “median.age with 0.742” , " beds…1000 with 0.6" and the “GPP…global.pandemic.preparedness” have low quality of representation with " 0.166", here we recall that the first two variales are highly correlated on the component 2

Contribution of variables to main axes:
##                                         Dim.1      Dim.2      Dim.3      Dim.4
## infected                           23.6325711  1.2468322  0.3612713  0.7179631
## Population..millions.               1.3723841 13.1684944 14.8818297 40.9462851
## Density..Km.                        0.1834370  0.3683759 52.4372368 40.4784359
## median.age                          3.4731249 39.7085437  0.2162543  3.7638967
## beds...1000                         0.2643051 32.0850509 20.0427231  6.3093461
## GPP...global.pandemic.preparedness  8.9015314  8.9277787  9.7212000  3.8496828
## FPY..flights.per.year              20.5355183  3.9321037  0.2853139  1.6149212
## recovered                          19.1149254  0.4135825  0.8923734  0.1276420
## deaths                             22.5222027  0.1492381  1.1617975  2.1918270
##                                         Dim.5
## infected                            6.0955102
## Population..millions.              20.1676274
## Density..Km.                        4.8703679
## median.age                          1.6675347
## beds...1000                         8.0277626
## GPP...global.pandemic.preparedness 50.7646956
## FPY..flights.per.year               0.1102867
## recovered                           5.9765332
## deaths                              2.3196817

Contribution of variables on Axe 1 :

the top three variables with the best or highest Contributions are “infected with 23.63” , " deaths with 22.52" , “FPY..flights.per.year 20.53”and “recovered 19.11”, here we recall that these variales are highly correlated on the component 1

Contribution of variables on Axe 2 :

the top three variables with the best or highest Contributions are “median.age with 39.70” , " beds…1000 with 32.085" and the “GPP…global.pandemic.preparedness” have very low Contribution with " 3.932", here we recall that the first two variales are highly correlated on the component 2

INDIVIDUALS : quality and contribution:

Here we can remark that the most of the countries that have high quality of representation on the first component are mostly the medium and high income countries if we make couple steps behind to the 1st dim analysis we said that the infected variable is the most representative on it, plus we expressed it with the variable flights per year, here actually it makes lot of sens because normaly we you ( as a country ) have medium or high income means that you have medium or big aeroports and with you’ll have medium to high number of flights per year ! same goes for the 2nd dimenstion, the more income you have the more health care you will provide for your citizents.

7. clustring :

To start our classification and segmentation first we need to choose the right algorithem and methods that works well with our dataset,we will use the clValid package sub-functions to identify the best clustering approach and the optimal number of clusters. We will compare k-means, hierarchical and PAM clustering.

## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
## 
## Validation Measures:
##                                  3       4       5       6       7       8       9      10      11      12      13      14      15      16      17      18      19      20      21      22      23      24
##                                                                                                                                                                                                           
## hierarchical Connectivity   5.8579  9.7159 15.7774 23.3472 25.3472 27.2139 29.6806 34.6187 35.9270 40.8806 43.1306 50.6147 52.4480 62.4246 65.4742 67.4742 69.4504 74.3940 79.6794 80.6972 83.5234 86.4274
##              Dunn           0.9407  0.6527  0.3284  0.2399  0.2399  0.2399  0.2399  0.2569  0.2569  0.2569  0.2569  0.2569  0.2569  0.3082  0.3082  0.3082  0.3082  0.3313  0.4136  0.4136  0.4136  0.4136
##              Silhouette     0.5780  0.5108  0.3796  0.3335  0.3163  0.2712  0.2654  0.3428  0.3356  0.2981  0.2930  0.2953  0.2732  0.2564  0.2361  0.2232  0.2072  0.2147  0.2329  0.2252  0.2161  0.2049
## kmeans       Connectivity   5.8579  9.7159 14.9544 19.8425 21.8425 29.0917 31.5095 34.2929 37.2040 41.5643 42.8976 46.5369 55.0206 58.5651 64.9988 66.9988 68.9750 76.1782 80.3810 81.3988 85.1417 87.7345
##              Dunn           0.9407  0.6527  0.1241  0.1518  0.1518  0.2211  0.2044  0.2236  0.2367  0.3281  0.3281  0.3281  0.3205  0.3201  0.3201  0.3201  0.3201  0.4051  0.4254  0.4415  0.4528  0.4944
##              Silhouette     0.5780  0.5108  0.3527  0.3861  0.3708  0.3841  0.3620  0.3689  0.3640  0.3393  0.3283  0.3232  0.2810  0.2804  0.2740  0.2610  0.2508  0.2226  0.2312  0.2258  0.2076  0.2002
## pam          Connectivity  11.0655 14.5123 16.1496 24.1000 25.4290 35.6734 37.6734 44.7210 47.9127 49.2210 58.5310 59.4810 60.9155 63.1655 69.1079 70.9413 72.9413 79.6714 82.4976 84.4738 85.7167 87.4250
##              Dunn           0.1020  0.1020  0.1489  0.0916  0.2035  0.1914  0.2079  0.2079  0.2121  0.2121  0.1920  0.2434  0.2594  0.2733  0.2733  0.3300  0.3300  0.3613  0.3613  0.3955  0.4222  0.4944
##              Silhouette     0.3051  0.3305  0.3511  0.3998  0.4010  0.2945  0.2800  0.2584  0.2621  0.2739  0.2609  0.2573  0.2734  0.2710  0.2506  0.2539  0.2409  0.2338  0.2247  0.2220  0.2184  0.1984
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 5.8579 hierarchical 3       
## Dunn         0.9407 hierarchical 3       
## Silhouette   0.5780 hierarchical 3

Connectivity and Silhouette are both measurements of connectedness while the Dunn Index is the ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance. as for now we will use 3 clusters and the ‘hierarchical’ method.

just to verify our choice we will use the 26 creterias :

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 3 as the best number of clusters 
## * 5 proposed 4 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 3 proposed 8 as the best number of clusters 
## * 3 proposed 9 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 9 proposed  3 as the best number of clusters
## * 5 proposed  4 as the best number of clusters
## * 3 proposed  5 as the best number of clusters
## * 1 proposed  7 as the best number of clusters
## * 3 proposed  8 as the best number of clusters
## * 3 proposed  9 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

so this histograme above shows us that the majority of creterias has proposed 3 clusters

the dendrogram:

this dendrogram above tells the same story as the results from the clValid package, and the PCA individuals plot clustred by kmeans. three major clusters are present here, if we make the relation between the classification results and the PCA results we can say that the blue cluster (right group) are the medium to low and low incom countries, as result these countries have the lowest infected cases recorded and that’s because as we said before they dont have a high number of flights, and the green cluster (middle group) are the medium to high and high incom countries respectively medium to high and high number of flights per year, the last cluster the red cluster it is an individual by it’s self and that’s normal because US have the highest number of flights per year and now it’s recording the highest number of infected cases and it represent by it’s self 50% of the infected cases!

ADDING AN INDEX:

In this part we will classify the countries by a risk indicator called ‘state’ from 0 and 1 respectevily means low and high risk that we will create as follows:

if the total deaths are more then 3% of the total infected cases the country will take class 1 which is the highest and means it is very serious condition!

if the total deaths are less then 3% of the total infected cases the country will take class 0 which is the lowest and means the general condition is under control!

here we can see that 30 countries are at high risk, very indangered by this pendamic and 19 countries are take control over it!

DECISION TREE:

in this section we will not use the RANDOM FOREST because we dont have many individuals to create the unique noeds(default in R is 500), instead we will use the DECISION TREE:

the results here are very very interesting:

The tree structure is ideal for capturing interactions between features in the data. the tree tells us that if a country have deaths under 302 and a median age under 41 the risk indicator is ‘LOW’ which means that the country will be able to control the situation, and if it have median age above 41 it will have 6% chance to loose control, that means the infected people above the age of 41 will have a greater risk to die it they got infected!

now talking about the right side of the tree if a country have more then 302 deaths we will consider and other indicator which is the global pandemic preparedness, if this last is more then 54% there is a high chance for the country to be at high risk from this pandemic and it’s risk indicator is ‘1’HIGH’ which is not very logical.

if a country is prepared for a pandemic why will it be at risk ?

This confirmes the results from the PCA this unlogic results means two things :

first, this pandemic was beyond the expectations, which is a fact!

second, the countries who thought that they are prepared and didn’t follow the instructions for the lockdown, quarantine and they
relied on thier general health care indicators ( bed per 1000km², global pandemic preparedness, rehabitation beds) they got a hard
nockdown and got the highest numbers of infected cases, and the US is the best exemple for this.

FURTHERMORE, if a country have a GPP under 54% and recoveries above 4682 means that they are partially okay! but they are still indangered. But if they have recoveries under 4682 they are definitely at risk!

##         PREDICTED
## ORIGINAL HIGH LOW
##     HIGH   26   4
##     LOW     0  19

As we can above the confusion matrix show only 4 individuals are misplaced

## [1] "THE PRECISION OF THE CORRECT PLACED PREDICTIONS IS :"
## [1] 0.9183673
## [1] "THE PRECISION OF THE ERRORS IS :"
## [1] 0.08163265

here we have high precision which is very good!

logistic regression:

in this part we will work on the logistic regression and we will use the automatic methods ‘BACKWARD’ and ‘FORWARD’ with the ‘BOTH’ argument in the stepAIC function: Furthermore, we will make a little modification here on our state index, we will calculate the totale active cases on the totale infected cases and we will round this number and it will either 1 or 0, to understand more 1 will be the high risk and 0 the low risk! and we are going to remove the infected column to avoid the over fitting.

HERE’s the results:

## Start:  AIC=60.63
## state ~ 1
## 
##                                      Df Deviance    AIC
## + recovered                           1   53.213 57.213
## + GPP...global.pandemic.preparedness  1   56.354 60.354
## <none>                                    58.630 60.630
## + median.age                          1   56.999 60.999
## + Density..Km.                        1   57.427 61.427
## + ds                                  1   57.966 61.966
## + Population..millions.               1   58.444 62.444
## + deaths                              1   58.560 62.560
## + beds...1000                         1   58.599 62.599
## + FPY..flights.per.year               1   58.605 62.605
## 
## Step:  AIC=57.21
## state ~ recovered
## 
##                                      Df Deviance    AIC
## + deaths                              1   45.445 51.445
## + FPY..flights.per.year               1   49.233 55.233
## <none>                                    53.213 57.213
## + GPP...global.pandemic.preparedness  1   52.442 58.442
## + Density..Km.                        1   52.523 58.523
## + median.age                          1   52.719 58.719
## + ds                                  1   52.747 58.747
## + beds...1000                         1   53.093 59.093
## + Population..millions.               1   53.196 59.196
## - recovered                           1   58.630 60.630
## 
## Step:  AIC=51.45
## state ~ recovered + deaths
## 
##                                      Df Deviance    AIC
## + GPP...global.pandemic.preparedness  1   40.582 48.582
## + ds                                  1   41.626 49.626
## <none>                                    45.445 51.445
## + median.age                          1   44.174 52.174
## + Population..millions.               1   44.393 52.393
## + FPY..flights.per.year               1   44.548 52.548
## + Density..Km.                        1   44.981 52.981
## + beds...1000                         1   45.337 53.337
## - deaths                              1   53.213 57.213
## - recovered                           1   58.560 62.560
## 
## Step:  AIC=48.58
## state ~ recovered + deaths + GPP...global.pandemic.preparedness
## 
##                                      Df Deviance    AIC
## + FPY..flights.per.year               1   38.547 48.547
## <none>                                    40.582 48.582
## + ds                                  1   39.226 49.226
## + Population..millions.               1   40.153 50.153
## + Density..Km.                        1   40.205 50.205
## + beds...1000                         1   40.379 50.379
## + median.age                          1   40.547 50.547
## - GPP...global.pandemic.preparedness  1   45.445 51.445
## - deaths                              1   52.442 58.442
## - recovered                           1   56.040 62.040
## 
## Step:  AIC=48.55
## state ~ recovered + deaths + GPP...global.pandemic.preparedness + 
##     FPY..flights.per.year
## 
##                                      Df Deviance    AIC
## <none>                                    38.547 48.547
## - FPY..flights.per.year               1   40.582 48.582
## + ds                                  1   37.422 49.422
## + Density..Km.                        1   38.185 50.185
## + beds...1000                         1   38.301 50.301
## + median.age                          1   38.514 50.514
## + Population..millions.               1   38.533 50.533
## - GPP...global.pandemic.preparedness  1   44.548 52.548
## - deaths                              1   47.035 55.035
## - recovered                           1   56.031 64.031

this is the final model selected by the function:

## 
## Call:
## glm(formula = state ~ recovered + deaths + GPP...global.pandemic.preparedness + 
##     FPY..flights.per.year, family = binomial, data = (newdf))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8705  -0.1346   0.3453   0.6051   2.2430  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                         6.949e+00  2.613e+00   2.660  0.00782 **
## recovered                          -9.497e-05  3.510e-05  -2.706  0.00682 **
## deaths                              2.168e-04  9.524e-05   2.276  0.02283 * 
## GPP...global.pandemic.preparedness -9.647e-02  4.370e-02  -2.208  0.02727 * 
## FPY..flights.per.year               6.266e-07  5.219e-07   1.201  0.22991   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 58.630  on 48  degrees of freedom
## Residual deviance: 38.547  on 44  degrees of freedom
## AIC: 48.547
## 
## Number of Fisher Scoring iterations: 6

as we can see from the results above, most the p-values are very significant

in this section we will see the accuracy of the predicted selected model :

## [1] "THE ACCURACY OF THE CORRECT PLACED PREDICTIONS IS :"
## [1] 0.8461538
## [1] "THE ACCURACY OF THE ERRORS IS :"
## [1] 0.1538462

we have a good accuracy!

SUM-up interpretations:

the number 1 factor in the evolution of infected cases is the flights per year indicator!

proof : USA has the highest number of Flights per year and the Highest number of infected cases

the GPP, global pandemic preparedness indicator does not explain the percentage of the recoveries.

the general health idicators (see in PCA results), did not explain if a country could be effective against the virus or not !

the numbers of deaths increases where a country have a median age above 41.